I’ve mostly stuck to Google Maps API through ggmap, but I also uploaded a spreadsheet of 3,500 unique addresses to geocod.io to test their reverse geocoding. The output was nicer than ggmaps in that it has a confidence score and a census tract. It cost me $2.12. Let’s peak at the data:
# helper function
showTable <- . %>%
kbl() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
geocodio <- vroom(here("data/geocodio_trial_run.csv.gz")) %>% clean_names()
geocodio %>%
select(1:5, city, census_tract_code) %>%
head() %>%
showTable()
| address | latitude | longitude | accuracy_score | accuracy_type | city | census_tract_code |
|---|---|---|---|---|---|---|
| 700 BLOCK EAST 10TH STREET Austin, TX | 30.32637 | -97.77126 | 0.33 | place | Austin | 000102 |
| 500 BLOCK EAST 6TH STREET Austin, TX | 30.32637 | -97.77126 | 0.33 | place | Austin | 000102 |
| 6TH STREET / TRINITY STREET Austin, TX | 30.26707 | -97.73933 | 0.80 | intersection | Austin | 001100 |
| 1200 BLOCK NORTH I H 35 Austin, TX | 30.32637 | -97.77126 | 0.33 | place | Austin | 000102 |
| 1500 BLOCK I H 35 Austin, TX | 30.32637 | -97.77126 | 0.33 | place | Austin | 000102 |
| 600 BLOCK RED RIVER STREET Austin, TX | 30.32637 | -97.77126 | 0.33 | place | Austin | 000102 |
So a first sanity check would be to see if there are any addresses not geocoded to the city of Austin:
geocodio %>% distinct(city)
## # A tibble: 25 x 1
## city
## <chr>
## 1 Austin
## 2 Round Rock
## 3 San Antonio
## 4 Jarrell
## 5 Marble Falls
## 6 Pflugerville
## 7 Smithville
## 8 San Marcos
## 9 Cedar Park
## 10 Georgetown
## # … with 15 more rows
These are Texas cities, but they are definitely not suburbs of Austin. The austin_city_limits.shp file contains geometries for the city council districts in Austin. Let’s use a rough polygon of those districts to see how often geocod.io geocoded to a point outside of Austin:
# using the sf package here
austin_city_limits <-
st_read(here("data/austin_city_limits.shp")) %>%
st_union() %>%
st_convex_hull()
## Reading layer `austin_city_limits' from data source `/Users/nate/workspace/Austin Homelessness/data/austin_city_limits.shp' using driver `ESRI Shapefile'
## Simple feature collection with 199 features and 9 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -98.01504 ymin: 30.04298 xmax: -97.48034 ymax: 30.51685
## geographic CRS: WGS84(DD)
# create an "sf" object which is just a dataframe with geometries attached
geocodio_geometry <- geocodio %$% map2(longitude, latitude, ~ st_point(c(.x, .y)))
geocodio %<>% st_sf(geometry = geocodio_geometry, crs = st_crs(austin_city_limits))
geocodio %<>% cbind(within_austin = st_intersects(geocodio, austin_city_limits, sparse = F))
# how many points were not in Austin?
geocodio %>%
count(within_austin) %>%
as_tibble() %>%
select(within_austin, n)
## # A tibble: 2 x 2
## within_austin n
## <lgl> <int>
## 1 FALSE 128
## 2 TRUE 3239
That doesn’t look too bad… about 4% of the geocoded addresses are outside of the city of Austin. I’m comfortable dropping them, but let’s take a closer look at the data and see if we can learn anything about the addresses that are getting geocoded to a location outside of the city of Austin.
# let's view a few of the points that are not in Austin
geocodio %>%
filter(!within_austin) %>%
select(address, accuracy_score, city, county) %>%
head(10) %>%
showTable()
| address | accuracy_score | city | county | geometry |
|---|---|---|---|---|
| 110 NORTH IH 35 SERVICE ROAD Austin, TX | 0.90 | Round Rock | Williamson County | POINT (-97.6872 30.5083) |
| 8TH AT COLORADO Austin, TX | 1.00 | San Antonio | Bexar County | POINT (-98.48153 29.43139) |
| N. IH35/WF Austin, TX | 0.68 | Jarrell | Williamson County | POINT (-97.60652 30.8247) |
| N. IH35 S/B/E. CESAR CHAVEZ Austin, TX | 0.68 | San Antonio | Bexar County | POINT (-98.47967 29.4381) |
| E. 11TH STREET/WF/IH35 Austin, TX | 0.76 | Marble Falls | Burnet County | POINT (-98.26873 30.57753) |
| W. 3RD STREET/LAVACA STREET Austin, TX | 0.80 | Smithville | Bastrop County | POINT (-97.16934 30.01062) |
| NORTH IH35 AND EAST 12TH STREET Austin, TX | 0.15 | San Marcos | Hays County | POINT (-97.92301 29.88173) |
| 12TH STREET/IH35 SB Austin, TX | 0.82 | San Antonio | Bexar County | POINT (-98.48095 29.43759) |
| 6th At Neches Street Austin, TX | 1.00 | Georgetown | Williamson County | POINT (-97.68081 30.63813) |
| 12th Street And Ih35 South Bound Austin, TX | 0.82 | San Antonio | Bexar County | POINT (-98.48095 29.43759) |
As a native Austinite I can locate these addresses within the city of Austin. Non-natives may not realize how wrong the given estimates are. Let’s plot the points to get a better sense:
# exclude anything from the plot that is obviously wrong (two points were located in Boston :( )
geocodio_minus_boston <- geocodio %>% filter(longitude < -90)
stamen_map <- st_bbox(geocodio_minus_boston) %>%
as.numeric() %>%
get_stamenmap()
ggmap(stamen_map) +
geom_sf(data = geocodio_minus_boston, inherit.aes = F, color = "#F54D97")
Let’s do exactly what we did with geocod.io and see how good Google Maps (via ggmaps) did. First, let’s peak at the shape of the data and then create sf POINT objects like we did before.
google_maps <- vroom(here("data/google_cache.csv.gz")) %>% clean_names()
google_maps %>%
head(5) %>%
showTable()
| google_search | lon | lat |
|---|---|---|
| 115 SANDRA MURAIDA WAY Austin, TX | -97.75483 | 30.26827 |
| 12600 BLOCK NORTH IH 35 Austin, TX | -97.71769 | 30.29847 |
| 800 BLOCK EMBASY DRIVE Austin, TX | -97.73227 | 30.26757 |
| 800 BLOCK ELM BERRY Austin, TX | -97.74475 | 30.26461 |
| WEST 5TH STREET and SOUTH MOPAC EXPRESSWAY Austin, TX | -97.76446 | 30.27393 |
# create an "sf" object which is just a dataframe with geometries attached
maps_geometry <- google_maps %$% map2(lon, lat, ~ st_point(c(.x, .y)))
google_maps %<>% st_sf(geometry = maps_geometry, crs = st_crs(austin_city_limits))
google_maps %<>% cbind(within_austin = st_intersects(google_maps, austin_city_limits, sparse = F))
# how many points were not in Austin?
google_maps %>%
count(within_austin) %>%
as_tibble() %>%
select(within_austin, n)
## # A tibble: 2 x 2
## within_austin n
## <lgl> <int>
## 1 FALSE 116
## 2 TRUE 3576
Peaking at the data for “Wrong City” points here would not be very informative since we just have lat and lon and no other identifying information. Let’s just skip straight to the plot:
ggmap(stamen_map) +
geom_sf(data = google_maps, inherit.aes = F, color = "#F54D97")
There were definitely errors that both APIs committed. From the maps it doesn’t look like they’re committing the “same type” of error, i.e. it looks like errors are clustered in differenct places (in San Antonio for geocod.io, and along I-35 for Google Maps). So perhaps we can conclude that in any one instance, one of the answers is “good enough”.
Let’s build a dataset that contains the distance between the two APIs for datapoints. That way, we can look at addresses where geocod.io and Google Maps really disagree.
citations_by_tract <- vroom(here("data/citations_by_tract.csv.gz"))
comparison_dataset <- citations_by_tract %>%
distinct(google_search) %>%
inner_join(google_maps, c("google_search")) %>%
inner_join(geocodio, c("google_search" = "address"))
distance <- ~ st_distance(.x, .y) %>%
units::set_units(mi) %>%
as.numeric()
comparison_dataset %<>% mutate(distance_miles = map2_dbl(geometry.x, geometry.y, distance))
comparison_dataset %>%
arrange(desc(distance_miles)) %>%
filter(distance_miles > 2) %>%
select(google_search, distance_miles, within_austin_google = within_austin.x, within_austin_geocodio = within_austin.y) %>%
showTable()
| google_search | distance_miles | within_austin_google | within_austin_geocodio |
|---|---|---|---|
| N CAPITAL TEXAS and 2222 Austin, TX | 29.297930 | TRUE | FALSE |
| CAP TEX and FM 2222 Austin, TX | 29.254947 | TRUE | FALSE |
| 100 Block North Bound I35 Austin, TX | 14.818215 | FALSE | TRUE |
| 13700 BLOCK OF US HIGHWAY 183 SERVICE and ROAD AT LAKE CREEK PARKWAY Austin, TX | 10.605390 | FALSE | TRUE |
| 13700 BLOCK NORTH US HIGHWAY 183 and SERVICE ROAD Austin, TX | 10.605390 | FALSE | TRUE |
| 800 Block Ih35 North Austin, TX | 9.330488 | FALSE | TRUE |
| E 7TH & N OH 35 SVRD NB Austin, TX | 8.878182 | FALSE | FALSE |
| 100 Block North Ih35 Austin, TX | 7.683405 | FALSE | TRUE |
| 100 Block North Ih35 West Frontage Austin, TX | 7.683405 | FALSE | TRUE |
| 100 Block North Ih35 Service Road Austin, TX | 7.683405 | FALSE | TRUE |
| 100 Block Ih 35 Austin, TX | 7.683405 | FALSE | TRUE |
| 100 BLOCK I H 35 Austin, TX | 7.683405 | FALSE | TRUE |
| 100 BLOCK NORTH IH35 SOUTH BOUND Austin, TX | 7.683405 | FALSE | TRUE |
| 100 Block Ih35 Austin, TX | 7.683405 | FALSE | TRUE |
| 100 Block North Ih35 South Bound Austin, TX | 7.683405 | FALSE | TRUE |
| 8TH & NECHES STREET Austin, TX | 7.198805 | FALSE | TRUE |
| 11655 NORTH US HIGHWAY 183 SERVICE ROAD Austin, TX | 4.317182 | FALSE | TRUE |
| 12781 NORTH U S HIGHWAY 183 SERVICE ROAD Austin, TX | 4.154996 | FALSE | TRUE |
| 900 Block Ih35 Southbound and Hertz Austin, TX | 3.363920 | FALSE | TRUE |
| 200 BLOCK IH 35 Austin, TX | 3.357029 | FALSE | TRUE |
| 701 ROBERT E. LEE ROAD Austin, TX | 2.959464 | FALSE | TRUE |
| 600 Block Sabine Street and Waller Creek Austin, TX | 2.460667 | FALSE | TRUE |
This table suggests a heuristic approach: For all the points with the biggest distance apart, we tend to see one API be “objectively right” in that it is inside Austin City Limits while the other is not.
I will take being inside Austin City Limits as a sufficient condition for accuracy, then. That means that, using Google Maps API as the initial source of truth,